About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models.

Data Description:

The dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Dataset have the following fields:

- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Load and explore the data

Load data and install packages

## Import required packages
install.packages("timetk")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

Loading Libraries

library(timetk)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(TTR)
library(tseries)  # For adf.test()

Loading Data

# Read the CSV files
dfd <- read.csv("day.csv")

Describe and explore the data

print("Structure of dfd")
## [1] "Structure of dfd"
str(dfd)
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
print("Summary of dfd")
## [1] "Summary of dfd"
summary(dfd)
##     instant         dteday              season            yr        
##  Min.   :  1.0   Length:731         Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   :366.0                      Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0                      Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

Correlation Analysis

Correlation between the normalized temperature and normalized feeling temperature and the total count of bike rentals:

cor(dfd$temp, dfd$cnt)
## [1] 0.627494
cor(dfd$atemp, dfd$cnt)
## [1] 0.6310657

Mean and Median of temperature by season

Mean and median temperatures for different seasons (Winter, Fall, Summer, and Spring)

aggregate(temp ~ season, data = dfd, FUN = function(x) c(mean = mean(x), median = median(x)))
##   season temp.mean temp.median
## 1      1 0.2977475   0.2858330
## 2      2 0.5444052   0.5620835
## 3      3 0.7063093   0.7145830
## 4      4 0.4229060   0.4091665

Mean values of different factors per months

The mean temperature, humidity, wind speed, and total rentals per month

library(dplyr)
dfd$dteday <- as.Date(dfd$dteday)

monthly_summary <- dfd %>% 
  mutate(month = format(dteday, "%Y-%m")) %>%
  group_by(month) %>%
  summarise(mean_temp = mean(temp, na.rm = TRUE),
            mean_humidity = mean(hum, na.rm = TRUE),
            mean_wind_speed = mean(windspeed, na.rm = TRUE),
            total_rentals = sum(cnt, na.rm = TRUE))
print(monthly_summary)
## # A tibble: 24 × 5
##    month   mean_temp mean_humidity mean_wind_speed total_rentals
##    <chr>       <dbl>         <dbl>           <dbl>         <int>
##  1 2011-01     0.198         0.584           0.195         38189
##  2 2011-02     0.283         0.560           0.229         48215
##  3 2011-03     0.332         0.569           0.232         64045
##  4 2011-04     0.471         0.668           0.244         94870
##  5 2011-05     0.577         0.713           0.181        135821
##  6 2011-06     0.693         0.593           0.178        143512
##  7 2011-07     0.759         0.590           0.172        141341
##  8 2011-08     0.705         0.627           0.191        136691
##  9 2011-09     0.613         0.784           0.153        127418
## 10 2011-10     0.470         0.707           0.176        123511
## # ℹ 14 more rows

Relationship between temperature and bike rentals

Temperature associated with bike rentals (registered vs. casual)

cor(dfd$temp, dfd$registered)
## [1] 0.540012
cor(dfd$temp, dfd$casual)
## [1] 0.5432847

Create interactive time series plots

Temperatue across seasons

boxplot(temp ~ season, data = dfd,
        main = "Temperature by Season",
        xlab = "Season",
        ylab = "Normalized Temperature",
        col = c("blue", "green", "red", "yellow"))

Monthly rentals and temperature

library(ggplot2)

# Convert month to a proper date format for plotting
monthly_summary$month <- as.Date(paste0(monthly_summary$month, "-01"))

# Plotting
ggplot(monthly_summary, aes(x = month)) +
  geom_line(aes(y = mean_temp, color = "Mean Temperature")) +
  geom_line(aes(y = total_rentals/100000, color = "Total Rentals (scaled)")) + # Scaling rentals for visibility
  labs(title = "Mean Temperature and Total Rentals per Month",
       x = "Month",
       y = "Value") +
  scale_y_continuous(sec.axis = sec_axis(~.*100, name = "Total Rentals")) +
  theme_minimal()

Interactive time serios plot

dfd$date <- as.Date(dfd$dteday)  # Ensure the date column is in Date format
dfd <- dfd %>%
  mutate(year = year(dteday), month = month(dteday))
# Aggregate data by date to get total rentals per day
daily_rentals <- dfd %>%
  group_by(dteday) %>%
  summarise(total_rentals = sum(cnt, na.rm = TRUE))

# Verify correctness
head(daily_rentals)  # Check the first few rows of the aggregated data
## # A tibble: 6 × 2
##   dteday     total_rentals
##   <date>             <int>
## 1 2011-01-01           985
## 2 2011-01-02           801
## 3 2011-01-03          1349
## 4 2011-01-04          1562
## 5 2011-01-05          1600
## 6 2011-01-06          1606
str(daily_rentals)   # Check the structure of the data
## tibble [731 × 2] (S3: tbl_df/tbl/data.frame)
##  $ dteday       : Date[1:731], format: "2011-01-01" "2011-01-02" ...
##  $ total_rentals: int [1:731] 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
# Create an interactive time series plot
plot_time_series(
  .data = daily_rentals,
  .date_var = dteday,
  .value = total_rentals,
  .interactive = TRUE,
  .plotly_slider = TRUE,
  .title = "Total Daily Bike Rentals"
  #.y_label = "Total Rentals"
)

Seasonal Diagnostics

# Seasonal diagnostics
plot_seasonal_diagnostics(
  .data = daily_rentals,
  .date_var = dteday,
  .value = total_rentals
)

Anomaly Diagnostics

# Anomaly diagnostics
plot_anomaly_diagnostics(
  .data = daily_rentals,
  .date_var = dteday,
  .value = total_rentals
)
## frequency = 7 observations per 1 week
## trend = 92 observations per 3 months

Smooth time series data

# Ensure the dteday column is of Date type and total_rentals is numeric
daily_rentals$dteday <- as.Date(daily_rentals$dteday)
daily_rentals$total_rentals <- as.numeric(daily_rentals$total_rentals)

# Step 1: Clean the time series data
# Create a time series object
ts_data <- ts(daily_rentals$total_rentals, frequency = 365, start = c(year(min(daily_rentals$dteday)), yday(min(daily_rentals$dteday))))

# Clean the time series data
cleaned_ts <- tsclean(ts_data)

# Step 2: Apply Simple Exponential Smoothing
ses_model <- HoltWinters(cleaned_ts, gamma = FALSE)  # Simple Exponential Smoothing

# Step 3: Apply Simple Moving Average with order 10
sma_values <- SMA(cleaned_ts, n = 10)

# Step 4: Plotting the results
plot.ts(cleaned_ts, main = "Cleaned Time Series Data", ylab = "Total Rentals", xlab = "Time")
lines(ses_model$fitted[,1], col = "blue", lwd = 2, lty = 2)  # Fitted values from SES
lines(sma_values, col = "red", lwd = 2)  # SMA line
legend("topright", legend = c("Cleaned Data", "SES Fitted", "SMA"), col = c("black", "blue", "red"), lty = c(1, 2, 1), lwd = 2)

Decompose and access the stationarity of time series data

To decompose time series data and assess its stationarity, we will follow these steps:

Step 1: Decompose the Time Series
we will use the decompose() or stl() functions to break down the time series into its components: trend, seasonal, and remainder.

Step 2: Assess Stationarity
To check if the time series is stationary, we will use:

Step 3: Make the Time Series Stationary
If the time series is not stationary, we will apply differencing using the diff() function.

# Create a time series object
ts_data <- ts(daily_rentals$total_rentals, frequency = 365, start = c(year(min(daily_rentals$dteday)), yday(min(daily_rentals$dteday))))

# Step 1: Decompose the time series
decomposed <- stl(ts_data, s.window = "periodic")  # Use stl for seasonal decomposition

# Plot the decomposed components
plot(decomposed)

# Extract the seasonal component
seasonal_component <- decomposed$time.series[, "seasonal"]

# Create a stationary time series by removing the seasonal component
stationary_ts <- ts_data - seasonal_component

# Step 2: Assess stationarity

# Plot ACF and PACF
par(mfrow = c(1, 2))
acf(stationary_ts, main = "ACF of Stationary Series")
pacf(stationary_ts, main = "PACF of Stationary Series")

# Perform Augmented Dickey-Fuller test
adf_test_result <- adf.test(stationary_ts, alternative = "stationary")
print(adf_test_result)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  stationary_ts
## Dickey-Fuller = -3.7756, Lag order = 9, p-value = 0.02022
## alternative hypothesis: stationary
# Step 3: Differencing if not stationary

# Check if ADF test p-value is greater than 0.05 (not stationary)
if (adf_test_result$p.value > 0.05) {
  differenced_ts <- diff(stationary_ts)
  # Plot differenced time series
  plot(differenced_ts, main = "Differenced Time Series", ylab = "Differenced Rentals")
  
  # Recheck stationarity
  adf_test_result_diff <- adf.test(differenced_ts, alternative = "stationary")
  print(adf_test_result_diff)
} else {
  message("The time series is already stationary.")
}
## The time series is already stationary.

Explanation of Steps
Decompose the Time Series:
1- We used stl() to decompose the time series. This function allows us to view the trend, seasonal, and remainder components.
2- Plot the decomposition to visualize these components.

Assess Stationarity:
1- Plot the ACF and PACF to visually inspect the autocorrelation structure of the series.
2- Use adf.test() to statistically assess stationarity. A p-value > 0.05 indicates that the series is not stationary.

Make the Time Series Stationary:
1- If the series is not stationary, apply the diff() function to difference the series and remove trends. This often helps to stabilize the mean of the time series.
2- After differencing, recheck stationarity using the ADF test.

Fit and forecast time series data using ARIMA models:

To fit and forecast time series data using ARIMA models, we will follow these steps in R. This process includes fitting both manual and automatic ARIMA models, checking the residuals, and making forecasts.

Step 1: Fit ARIMA Models
we will fit both manual ARIMA models using the arima() function and automatic models using auto.arima() from the forecast package.

Step 2: Check Residuals
After fitting the models, we will check the residuals to ensure they are normally distributed and uncorrelated.

Step 3: Forecasting
we will use the forecast() function to predict future values and compare the results of the manual and automatic models.

# Create a time series object
ts_data <- ts(daily_rentals$total_rentals, frequency = 365, start = c(year(min(daily_rentals$dteday)), yday(min(daily_rentals$dteday))))

# Step 1: Fit Manual ARIMA Model
manual_arima_model <- arima(ts_data, order = c(1, 1, 1))  # Change (p,d,q) as necessary
summary(manual_arima_model)
## 
## Call:
## arima(x = ts_data, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1      ma1
##       0.3584  -0.8896
## s.e.  0.0423   0.0192
## 
## sigma^2 estimated as 855419:  log likelihood = -6021.96,  aic = 12049.91
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 12.87842 924.2558 647.6408 -44.46123 58.52916 0.8873291 0.01117124
# Step 2: Fit Automatic ARIMA Model
auto_arima_model <- auto.arima(ts_data)
summary(auto_arima_model)
## Series: ts_data 
## ARIMA(1,0,2)(0,1,0)[365] with drift 
## 
## Coefficients:
##          ar1      ma1      ma2   drift
##       0.9586  -0.6363  -0.1892  5.7093
## s.e.  0.0283   0.0583   0.0506  0.7566
## 
## sigma^2 = 1599566:  log likelihood = -3131.76
## AIC=6273.52   AICc=6273.68   BIC=6293.03
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 5.357072 890.0137 457.0405 -44.28372 51.73145 0.1967752 0.01047273
# Step 3: Check Residuals for manual ARIMA model
residuals_manual <- residuals(manual_arima_model)
shapiro_test_manual <- shapiro.test(residuals_manual)

print(shapiro_test_manual)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_manual
## W = 0.90528, p-value < 2.2e-16
# Plot ACF and PACF of residuals
par(mfrow = c(1, 2))
acf(residuals_manual, main = "ACF of Manual ARIMA Residuals")
pacf(residuals_manual, main = "PACF of Manual ARIMA Residuals")

# Check residuals for automatic ARIMA model
residuals_auto <- residuals(auto_arima_model)
shapiro_test_auto <- shapiro.test(residuals_auto)
print(shapiro_test_auto)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_auto
## W = 0.774, p-value < 2.2e-16
# Step 4: Forecasting
# Forecast next 25 observations with Manual ARIMA
forecast_manual <- forecast(manual_arima_model, h = 25)
plot(forecast_manual, main = "Forecast from Manual ARIMA Model")

# Forecast next 25 observations with Automatic ARIMA
forecast_auto <- forecast(auto_arima_model, h = 25)
plot(forecast_auto, main = "Forecast from Automatic ARIMA Model")

Explanation of Steps
Fit ARIMA Models:

1- Manual ARIMA: Adjust the (p, d, q) values based on your analysis of ACF and PACF plots. The example uses (1, 1, 1), but you may want to experiment with different values.
2- Automatic ARIMA: The auto.arima() function automatically selects the best ARIMA model based on AIC/BIC criteria.

Check Residuals: 1- Use the Shapiro-Wilk test to check for normality of the residuals. A p-value > 0.05 suggests that the residuals are normally distributed.
2- Plot ACF and PACF of the residuals to identify any patterns. Ideally, the residuals should be white noise (no significant correlations).

Forecasting:
1-Use the forecast() function to generate forecasts for the next 25 observations. This will give you both point forecasts and confidence intervals.
2-Plot the forecasts for visual analysis.

Comments on Forecasts
After running the code, you can compare the forecasts from both models based on their accuracy, trend patterns, and any other insights you might gather from the foreca plots. This will help you decide which model is more suitable for your data.

Task Six: Findings and Conclusions

Throughout the process of this project, I gained valuable insights and developed a deeper understanding of the subject matter. The research journey involved thorough data analysis and interpretation, which allowed me to explore various facets of the topic. Here are some of the key components of my findings and conclusions:

Key Learnings
1. Data Handling:
I improved my skills in data manipulation and visualization. Using R and its packages enabled me to clean, transform, and visualize data effectively, reinforcing the importance of data quality in analysis.

2. Analytical Techniques:
I became familiar with different statistical methods and how to apply them to derive meaningful insights from the data. This hands-on experience solidified my understanding of concepts learned in theory.

3. Critical Thinking:
The project encouraged me to think critically about the results and their implications. I learned to question assumptions and consider alternative explanations for the patterns observed in the data.

Results vs. Expectations
The results obtained were somewhat aligned with my expectations but also presented some surprises. I anticipated certain trends based on preliminary research, yet some findings contradicted my hypotheses. This highlighted the value of data-driven insights over preconceived notions.

Key Findings and Takeaways
1. Trends and Patterns:
The analysis revealed significant trends that were likely obvious such the impact of temperature and seasons on the total rental, and some that were not immediately obvious, underscoring the importance of comprehensive data analysis in uncovering hidden insights.

2. Implications for Practice:
The findings suggest practical applications in the relevant field, which could inform future decision-making processes. This emphasizes the relevance of empirical evidence in shaping policies and strategies.

3. Future Research Directions:
In this prece of work i have include or worked on the data that has only the day related data. however we can deepen this research by i=ncluding the data of hour. Also we can include or bring into work the holiday data. The project identified gaps in the current literature, paving the way for further research. Future studies could build on these findings to explore additional variables or refine the analysis methods used.

In conclusion, this project not only enhanced my technical skills but also deepened my understanding of the subject matter. The experience reinforced the significance of a rigorous approach to research and the need to remain open to new insights that data may reveal.

MOHAMMAD ALI ASIF Dated:Wed Sep 25 09:14:16 2024